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1. Introduction 



When dealing with a high dimensional observation vector, the natural question arises 
whether the data generating process can be approximated by a model of substantially lower 
dimension. Rather than on the true model, the focus is here on smaller ones which still 
contain the essential information and allow for interpretation. Typically, the models under 
consideration are characterized by the non-zero components of some parameter vector. 
Estimating the true model requires the rather idealistic situation that each component is 
either equals zero or has sufficiently modulus: A tiny perturbation of the parameter vector 
may result in the biggest model, so the question about the true model does not seem to be 
adequate in general. Alternatively, the model which is optimal in terms of risk appears as 
a target of many model selection strategies. Within a specified class of competing models, 
this paper is concerned with confidence regions for those approximating models which are 
optimal in terms of risk. 

Suppose that we observe a random vector X n = (Aj n )f =1 with distribution N n {6 n , a 2 I n ) 
together with an estimator a n for the standard deviation a > 0. Often the signal 9 n 
represents coefficients of an unknown smooth function with respect to a given orthonormal 
basis of functions. 

There is a vast amount of literature on point estimation of 9 n . For a given estimator 
9 n = 9 n (X n , <r n ) for 9 n , let 

L(9 n ,9 n ) := 1 1 6*7-1 — 6*n 1 1 2 and R(9 n ,9 n ) := KL(9 n ,9 n ) 

be its quadratic loss and the corresponding risk, respectively. Here || • || denotes the standard 
Euclidean norm of vectors. Various adaptivity results are known for this setting, often in 
terms of oracle inequalities. A typical result reads as follows: Let i&n^ceCn be a family 
of candidate estimators 9^ = 9 n c \x n ) for 9 n , where a > is temporarily assumed to be 
known. Then there exist estimators 9 n and constants A n ,B n = 0(log(n) 7 ) with 7 > 
such that for arbitrary 9 n in a certain set 6„ C K", 

R{9n,9 n ) < A n inf R(9^,9 n ) + B n a 2 . 

cec„ 

Results of this type are provided, for instance, by Polyak and Tsybakov (1991) and Donoho 
and Johnstone (1994, 1995, 1998), in the framework of Gaussian model selection by Birge 
and Massart (2001). The latter article copes in particular with the fact that a model is 
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not necessarily true. Further results of this type, partly in different settings, have been 
provided by Stone (1984), Lepski et al. (1997), Efromovich (1998), Cai (1999, 2002), to 
mention just a few. 

By way of contrast, when aiming at adaptive confidence sets one faces severe limita- 
tions. Here is a result of Li (1989), slightly rephrased: Suppose that 9 n contains a closed 
Euclidean ball B(8^ l ,cn 1 ^) around some vector 0° G M. n with radius cn 1 / 4 > 0. Still as- 
suming a to be known, let D n = D n (X n ) C ra be a (1 — a)-confidence set for n 6 n - 
Such a confidence set may be used as a test of the (Bayesian) null hypothesis that n is 
uniformly distributed on the sphere <9i?(#°, cn 1 / 4 ) versus the alternative that 9 n = We 
reject this null hypothesis at level a if \\i] — #°|| < cn 1 / 4 for all r] G D n . Since this test 
cannot have larger power than the corresponding Neyman-Pearson test, 

P^fsup \\ v - 9°J n < cnVA < f(s 2 < xl^n^/a 2 )) (S 2 n ~ xl) 
\e£>n 1 

= ^-\a) + 2- 1 / 2 c 2 /a 2 )+o{l), 

where X n -a(^ 2 ) stands for the a-quantile of the noncentral chi-squared distribution with 
n degrees of freedom and noncentrality parameter 8 2 . Throughout this paper, asymptotic 
statements refer to n — > oo. The previous inequality entails that no reasonable confidence 
set has a diameter of order o p (n 4 / 4 ) uniformly over the parameter space n , as long as the 
latter is sufficiently large. Despite these limitations, there is some literature on confidence 
sets in the present or similar settings; see for instance Beran (1996, 2000), Beran and 
Dumbgen (1998) and Genovese and Wassermann (2005). 

Improving the rate of O p (n 1 ^) is only possible via additional constraints on 9 n , i.e. con- 
sidering substantially smaller sets 6 n . For instance, Baraud (2004) developed nonasymp- 
totic confidence regions which perform well on finitely many linear subspaces. Robins and 
van der Vaart (2006) construct confidence balls via sample splitting which adapt to some 
extent to the unknown "smoothness" of 9 n . In their context, Q n corresponds to a Sobolev 
smoothness class with given parameter (/?, L). However, adaptation in this context is possi- 
ble only within a range [/?, 2/3]. Independently, Cai and Low (2006) treat the same problem 
in the special case of the Gaussian white noise model, obtaining the same kind of adap- 
tivity in the broader scale of Besov bodies. Other possible constraints on 9 n are so-called 
shape constraints; see for instance Cai and Low (2007), Dumbgen (2003) or Hengartner 
and Stark (1995). 
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The question is whether one can bridge this gap between confidence sets and point esti- 
mators. More precisely, we would like to understand the possibility of adaptation for point 
estimators in terms of some confidence region for the set of all optimal candidate estima- 
tors 9 n c) . That means, we want to construct a confidence region /C n)Q , = IC n , a (X n , a n ) C C n 
for the set 

JC n (e n ) := argmini^W) 

cec n 

= {cGC n : R(6i c \e n ) < R(6^\e n )ioT ahV G C n ) 
such that for arbitrary 9 n £ M n , 

^On(Kn(e n ) CK n , a ) > 1-Q (1) 

and 

max R(0$,e n ) } 

C£Kn ' a rm(c) a W = O p {A n )v^n R{9^\0 n ) + O p {B n )a 2 . (2) 
max L{ff^',u n ) I cec„ 

C&K,n,a ) 

Solving this problem means that statistical inference about differences in the performance 
of estimators is possible, although inference about their risk and loss is severely limited. 
In some settings, selecting estimators out of a class of competing estimators entails esti- 
mating implicitly an unknown regularity or smoothness class for the underlying signal 6 n . 
Computing a confidence region for good estimators is particularly suitable in situations 
in which several good candidate estimators fit the data equally well although they look 
different. This aspect of exploring various candidate estimators is not covered by the usual 
theory of point estimation. 

Note that our confidence region 1C n ^ a is required to contain the whole set JC n (6 n ), not 
just one element of it, with probability at least 1 — a. The same requirement is used by 
Futschik (1999) for inference about the argmax of a regression function. 

The remainder of this paper is organized as follows. For the reader's convenience our 
approach is first described in a simple toy model in Section [2j In Section [3] we develop and 
analyze an explicit confidence region /C n ,<* related to C n := {0, 1, ... ,n} with candidate 
estimators 

8$ ■= (1{< < k}x in )l v 

These correspond to a standard nested sequence of approximating models. Section H] dis- 
cusses richer families of candidate estimators. 

All proofs and auxiliary results are deferred to Sections [5] and [6j 
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2. A toy problem 

Suppose we observe a stochastic process Y = (^(£))te[o,i] > where 

Y(t) = F(t) + W(t), t€ [0,1], 

with an unknown fixed continuous function i 7 on [0, 1] and a Brownian motion W = 
(W"(i))t e [o,i]- We are interested in the set 

S(F) := argminF(t). 

te[o,l] 

Precisely, W6 Wcint to construct a. (1 — o;)-confi(i6nce region S a — S a (Y) C [0, 1] for S(F) 
in the sense that 

P(S(F)cS a ) > I -a, (3) 

regardless of F. To construct such a confidence set we regard Y(s) — Y(t) for arbitrary 
different s,t £ [0,1] as a test statistic for the null hypothesis that s £ S(F), i.e. large 
values of Y(s) — Y(t) give evidence for s <S(F). 
A first naive proposal is the set 

5r ive := {s G [0, 1] : Y(s) < mmY + 

with /t^ aive denoting the (1 — a)-quantile of max^i] W — min^i] W. 

Here is a refined version based on results of Dumbgen and Spokoiny (2001): Let K a be 
the (1 — a)-quantile of 

\W(s)-W(t)\ 



sup / Wl -V21og(e/|s-t|) . (4) 

Then constraint ([3]) is satisfied by the confidence region S a which consists of all s G [0, 1] 
such that 

Y(s) < Y(t) + y / |s-t|( v /21og(e/|s-t|) + k q ) for all t £ [0, 1]. 

To illustrate the power of this method, consider for instance a sequence of functions 
F = F n = c n F Q with positive constants c n — > oo and a fixed continuous function F a with 
unique minimizer s a . Suppose that 

Km F ° ( ,'> - F °"°' = 1 

t^s \t — S T 
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for some 7 > 1/2. Then the naive confidence region satisfies only 

max \t — s Q \ = O p (c^ 1 / 7 ), (5) 

tg 5naivc 

whereas 



max |t - So | = O P (log( C 0V(27-i) c -2/(2 7 -i)Y (g) 
3. Confidence regions for nested approximating models 

As in the introduction let X n = 6 n + e„ denote the n-dimensional observation vector with 
6 n £ W 1 and e n ~ M n (0, cr 2 I n ). For any candidate estimator 6^ = < k}X in )™ =1 the 
loss is given by 

n k 

L n (k):=L($W,O n ) = Yl el + J2(X in -8 in ) 2 

i=k+l i=l 

with corresponding risk 

n 

R n (k):=R{^\e n ) = e l + ka 2 . 

i=k+l 

Model selection usually aims at estimating a candidate estimator which is optimal in 
terms of risk. Since the risk depends on the unknown signal and therefore is not available, 
the selection procedure minimizes an unbiased risk estimator instead. In the sequel, the 
bias-corrected risk estimator for the candidate 9 n is defined as 

n 

R n {k) := ]T (Xf n -a 2 n ) + ka 2 n , 

i=k+l 

where a 2 is a variance estimator satisfying the subsequent condition. 
(A) a 2 and X n are stochastically independent with 



ma, 



^2 



n ,,2 



a 2 A m' 

where 1 < m = m n < 00 with m = 00 meaning that a is known, i.e. a 2 = a 2 . For 
asymptotic statements, it is generally assumed that 

= Oil) 

unless stated otherwise. 
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Example. Suppose that we observe Y = Mr/ + 5 with given design matrix M S 
g(n+m)xn Q £ ran k n> unknown parameter vector n G K n and unobserved error vector 
5 ~ A/" n + m (0, a 2 I n+m ). Then the previous assumptions are satisfied by _X" n := (M T M) l / 2 f} 
with t? := (M T M)~ 1 M T Y and a 2 : = \\Y - Mfj\\ 2 /m, where 6> n := (M T M) 1 / 2 r ] . 

Important for our analysis is the behavior of the centered and rescaled difference process 

Dn = (Dn(j,k)) Q < j<k < n With 

A f . , v ^n(j)-fin(fc)-^n(j)+^n(fc) 

a 2 (4\\9 n /a\\ 2 + 2n) 1/2 
Y£^l(Xl -a 2 - e 2 n ) - 2(k - j)(a 2 - a 2 ) 
a 2 (4\\9 n /a\\ 2 + 2n) 1/2 

One may also write D n (j, k) = (a n /ay 2 (D n (j, k) + V n (j, k)) with 

k 

:= /Attn TWT5 E (2(e in /a)(e m /a) + (e m /a) 2 - l), (7) 

k) := (A\\9 n /a\\ 2 + 2n)- 1/2 2(k-j)(l-a 2 /a 2 ) (8) 

This representation shows that the distribution of D n depends on the degrees of free- 
dom, m, and the unknown "signal-to-noise vector" 9 n /o~. The process D n consists of par- 
tial sums of the independent, but in general non-identically distributed random variables 
2{0in/ o){ei n / a) + (ei n /a) 2 — 1. The standard deviation of D n (j, k) is given by 

k 



Note that r n (0, n) = 1 by construction. To imitate the more powerful confidence region of 
Section [5] based on the multiscale approach, one needs a refined analysis of the increment 
process D n . Since this process does not have subgaussian tails, the standardization is more 
involved than the correction in (J3|). 

1 /2 

Theorem 1. Define F n (j, k) := (2 log (e/r n (j, k) 2 ) for < j < k < n. Then 



\D (i k)\ 

sup 1 n ; J \ , 71 < V32 logn + O p (l), 
0<j<k<n T n {j,k) 



and for any fixed c > 2, 



~ \D n (j,k)\ c-T n (j,kf 
d n := sup — — r n (j, k) 



o<j<k<n\ r n (j,k) (4\\9 n /a\\ 2 + 2n) 1/2 T n (j,k), 
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is bounded in probability. In case of \\9 n \\ 2 = 0{n), C(d n ) is weakly approximated by the 
law of 

5 n := sup — — - T n (j, k) , 

o<j<k<n\ r n (j,k) J 

where 

Wn{k-j) 



A n (j,k) = w(T n (o,ky)-w(T n (o,j) 2 )- 



^(4\\9 n /a\\ 2 + 2n) 1/2 

with a standard Brownian motion W and a random variable Z ~ AA(0, 1) independent of 
W. 

The limiting distribution indicates that the additive correction term in the definition of 
d n cannot be chosen essentially smaller. It will play a crucial role for the efficiency of the 
confidence region. 

To construct a confidence set for JC n (9 n ) by means of d n , we are facing the problem that 
the auxiliary function r n (-, •) depends on the unknown signal-to-noise vector 9 n /a. In fact, 
knowing r„ would imply knowledge of IC n (9 n ) already. A natural approach is to replace 
the quantities which are dependent on the unknown parameter by suitable estimates. A 
common estimator of the variance T n (j, k) 2 , j < k, is given by 

{n n —1 k 

J2(<Xfja 2 n - 1) + 2) ( 4 (^n/^n " 1) + 2). 

i=l } i=j+l 

However, using such an estimator does not seem to work since 

T n (j, k) 



sup 

0<j<k<n 



Tn(j, k) 

as n goes to infinity. This can be verified by noting that the (rescaled) numerator of 
(Tn(j,k) 2 ) 0< j <k<n is, up to centering, essentially of the same structure as the rescaled 
difference process D n itself. 



The least favourable case of constant risk 

The problem of estimating the set argmin^ R n (k) can be cast into our toy model where 
Y(t), F(t) and W(t) correspond to R n {k), R n (k) and the difference R n (k) — R n (k), re- 
spectively. One may expect that the more distinctive the global minima are, the easier it is 
to identify their location. Hence the case of constant risks appears to be least favourable, 
corresponding to a signal 
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"(k) 9 

In this situation, each candidate estimator B n has the same risk of no . 

A related consideration leading to an explicit procedure is as follows: For fixed indices 
< j < k < n, 

k 

Ra(j)-Rn(k) = Y, Gl n -{k-j)o\ 
i=j+l 

and if Assumption (A) is satisfied, the statistic 

rp _ TA=j+l X in Q R n {k) - R n (j) 

has a noncentral (in the numerator) i^-distribution 

with k — j and m degrees of freedom. Thus large or small values of Tj kn give evidence for 
R n (j) being larger or smaller, respectively, than R n (k). Precisely, 



< st . C e *(Tjkn) whenever j £ K, n 
> st . C e *(T jkn ) whenever k £ K r 



Note that this stochastic ordering remains valid if o\ is just independent from X n , i.e. also 
under the more general requirement of the remark at the end of this section. Via suitable 
coupling of Poisson mixtures of central x 2_ distributed random variables, this observation 
is extended to a coupling for the whole process (Tjkn) 0< j <k<n - 

Proposition 2 (Coupling). For any 9 n £ M n there exists a probability space with random 
variables {Tjkn) < j<k < n and (r/ fcn ) < j<fe < n such that 



£((^7fcn)o<j<A,<„) = "^n((^' A:n )o<7<fc<n)' 



and for arbitrary indices < j < k < n, 

Tjkn 



< T* kn whenever j 6 JC n (9 n ), 
> T* kn whenever k £ K, n (9 n ). 



As a consequence of Proposition we can define a confidence set for JC n (9 n ), based 
on this least favourable case. Let K n Q , denote the (1 — a)-quantile of Cg*(d n ), where for 
simplicity c := 3 in the definition of d n . Note also that T n (j,k) 2 = (k — j)/n in case of 
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n = 0*. Motivated by the procedure in Section [2] and Theorem [fl we define 

K n>a := |j : R n (j) < Rn{k) + a 2 n \k - j\c jkn for all k / j| (9) 
= {j ■ T ijn > 2 - c ijn for all « < j, 

Tjkn < 2 + Cj fen for all k > j} 

with 

c jkn = c jkn , a : = ^/^— -(r(-^) + «„, a ) + — _r(-^) . 

Theorem 3. Let (# n )ngN be arbitrary. With lC n ^ a as defined above, 



In case of (3 n — > fie. n/m — > 0), the critical values converge to the critical value n a 
introduced in Section® In general, K n , a = 0(1), and the confidence regions IC n ^ a satisfy 
the oracle inequalities 



max R n {k) < min R n {j) + (4\/3 + o p (l)) ./a 2 log(n) min R n (j) (10) 



+ O p (<r 2 log n) 



and 



max L n (k) < min L n (j) + O p /a 2 log(ra) min L„(j) (11) 



+ O p (a 2 log n). 



Remark (Dependence on a) The proof reveals a refined version of the bounds in 
Theorem [3] in case of signals n such that 



log(n) 3 = Ofmin R n (j) 
Let < a(n) -> such that ^ a(n) = o(min je e n Rn(j))- Then 



max R n (k) < min R n (j) 

k&K. n , a i eC " 



+ (4 v / 3Vfog^ + 2\/6 /c„,a + O p (l)) Ja 2 mini? n (j) 



uniformly in a > a(n) 
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Remark (Variance estimation) Instead of Condition (A), one may require more gen- 
erally that <r^ and X n are independent with 

for a given > 0. This covers, for instance, estimators used in connection with wavelets. 
There a is estimated by the median of some high frequency wavelet coefficients divided by 
the normal quantile <I> -1 (3/4). Theorem Q] continues to hold, and the coupling extends to 
this situation, too, with S 2 in the proof being distributed as na 2 . Under this assumption 
on the external variance estimator, the confidence region /£ niQ , defined with m := [2n/f3 2 \ , 
is at least asymptotically valid and satisfies the above oracle inequalities as well. 

4. Confidence sets in case of larger families of candidates 

The previous result relies strongly on the assumption of nested models. It is possible to 
obtain confidence sets for the optimal approximating models in a more general setting, 
albeit the resulting oracle property is not as strong as in the nested case. In particular, we 
can no longer rely on a coupling result but need a different construction. For the reader's 
convenience, we focus on the case of known a, i.e. m = oo; see also the remark at the end 
of this section. 

Let C n be a family of index sets C C {1, 2, . . . , n} with candidate estimators 

#°) := G C}Xi n )™ =l 

and corresponding risks 

R n (C) := R{9 {c \0 n ) = £C + Ma 2 , 

where | *S' | denotes the cardinality of a set S. For two index sets C and D, 

a- 2 (R n (D)-R n (C)) = 5 2 l (C\D)-6 2 n (D\C) + \D\-\C\ 
with the auxiliary quantities 

5 2 n {J) := X>ft> 2 i «/c{l,2,...,n}. 

Hence we aim at simultaneous (1 — a)-confidence intervals for these noncentrality param- 
eters S n (J), where J G M. n := {D \C : C,D G C n }. To this end we utilize the fact 
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that 

T n (J) := -^2^2 X fn 
ieJ 

has a x^j|(^(J))-distribution. We denote the distribution function of xl(^ 2 ) by Fk{- \ 5 2 ). 
Now let M n := \M n \ — 1 < |C n |(|C n | — 1), the number of nonvoid index sets J G M n - Then 
with probability at least 1 — a, 

a/(2M n ) < F|j| (T n (J) | J 2 ( J)) < 1 - a/(2M n ) for + J G A* n . (12) 

Since -Fjj|(T n (J) | <5 2 ) is strictly decreasing in J 2 with limit as 5 2 — > oo, (|12p entails the 
simultaneous (1 — a)-confidence intervals [<5 2 Q ; (J), <5 2 Q U (J)] for all parameters 5 2 (J) as 
follows: We set S 2 al (9) := := 0, while for nonvoid J, 

^(J) := min{5 2 >0:^| J |(T n (J)|5 2 )<l-a/(2M„)}, (13) 

5 2 m {J) := max{<5 2 >0:F| J |(T„(J)|5 2 )>a/(2M n )}. (14) 

By means of these bounds, we may claim with confidence 1 — a that for arbitrary C,D G C n 
the normalized difference (n/o 2 )(R n (D) — R n (C)) is at most 5^ au {C \D) — S 2 al (D \ 
C) + \D\ — \C\. Thus a (1 — a)-confidence set for fC n (9 n ) = argmin^g^ R n (C) is given by 

JC n>a := {CGC n : 5 2 m {C \ D) - 5 2 n ^{D \ C) + \D\ - \C\ > for all D G C n ). 

These confidence sets K, n ^ a satisfy the following oracle inequalities: 

Theorem 4. Let (9 n ) n( zfq be arbitrary, and suppose that log|C n | = o(n). Then 



max R n (C) < min R n {D) + O p [ a 2 log(|C n |) min R n {D) 

+ O p (a 2 log|C n |) 

and 



max L n (C) < min L n (D) + O p [la 2 log (| C n |) min L n {D) 

+ O p (o- 2 log \C n |). 

Remark. The upper bounds in Theorem [5] are of the form 



p n (l + P (y^ 2 log(|C n |)/p„ + a 2 log(|C n |)/p^ 

with p n denoting minimal risk or minimal loss. Thus Theorem 2] entails that the maximal 
risk (loss) over K, n)CX exceeds the minimal risk (loss) only by a factor close to one, provided 
that the minimal risk (loss) is substantially larger than <r 2 log \C n \. 
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Remark (Suboptimality in case of nested models) In case of nested models, 
the general construction is suboptimal in the factor of the leading (in most cases) term 



minjR n (j). Following the proof carefully and using that <T 2 log|C n | = 2cr 2 logn + O(l) 
in this special setting, one may verify that 



max Rn(k) < min R n {j) + (4\/8 + o p (l)) a 2 log(n) min R n (j) 

+ O p (fJ 2 logn). 

The intrinsic reason is that the general procedure does not assume any structure of the 
family of candidate estimators. Hence advanced multiscale theory is not applicable. 

Remark. In case of unknown a, let a' := 1 — (1 — a) 1 / 2 . Then with probability at least 
1 - a', 

a/2 < F m (m(a n /a) 2 \0) < 1 - a'/2. 

The latter inequalities entail that (a/a n ) 2 lies between r niQj ; := rn/xm;i-a'/2 an d T n ^ u := 
m /x m - a '/2- Then we obtain simultaneous (1 — a)-confidence bounds <5 2 a L (J) and S 2 a u (J) 
as in (|13p and (|14p by replacing a with a' and T n (J) with 

X in and J1 ^ L Y X in > 

respectively. The conclusions of Theorem H] continue to hold, as long as n/m n = 0(1). 
5. Proofs 

5.1. Proof of (ED and (Q) 

Note first that min^i] Y lies between F n (s ) + min^i] W and F n (s Q ) + W(s a ). Hence for 
any a' G (0, 1), 

S™™ C {s G [0, 1] : F n (s) + W(s) < F n (s ) + W(s ) + < aive } 
C {s G [0, 1] : F n ( S ) - F n (s D ) < < aive + < aivc } 
= { S G [0, 1] : F ( S ) - F (s ) < c-\k^ + < aivc )} 



and 



5S aive D {s G [0, 1] : F n (a) + < F n ( So ) + minl^ + < aive } 

[0)1] 

D {s G [0, 1] : F n {s) - F n (s ) < < aive - < aive } 

= { S G [0, 1] : F (s) - F (s ) < c' 1 ^^ - < aive )} 
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with probability 1 — a'. Since K,™ lve < K„ aive if a < a' < 1, these considerations, combined 
with the expansion of F a near s , show that the maximum of \s — s Q \ over all s G 5^ aive is 
precisely of order O p (c n 1 ^')- 

On the other hand, the confidence region S a is contained in the set of all s G [0, 1] such 
that 

F n {s) + W{s) < F n ( So ) + W(s ) + ^/|s-s |(^21og(e/|s-s |) + K a ) }, 
and this entails that 



F (s)-F (s ) < c^ls - s \ (^21og(e/|s - s Q \) + K a + O p {\) 
with O p (l) not depending on s. Now the expansion of F Q near s a entails claim ([6]). □ 

5.2. Exponential inequalities 

An essential ingredient for our main results is an exponential inequality for quadratic 
functions of a Gaussian random vector. It extends inequalities of Dahlhaus and Polonik 
(2006) for quadratic forms and is of independent interest. 

Proposition 5. Let Z±, . . . , Z n be independent, standard Gaussian random variables. Fur- 
thermore, let Ai, . . . , X n and 6\, . . . , 5 n be real constants, and define := Var (Y12=l ^i(^i + 
= J27=i + 4<5j 2 ). Then for arbitrary n > and A max := max(Ai, . . . , A n , 0), 

n 9 

7] 



(£ *((* + *)»-(! + fl)) > ,t) < exp(- - + 

< e 1/4 exp(-r//v / 8). 



i=l 



Note that replacing A, in Proposition [5] with —A, yields twosided exponential inequali- 
ties. By means of Proposition [5] and elementary calculations one obtains exponential and 
related inequalities for noncentral x 2 distributions: 

Corollary 6. For an integer n > and a constant 5 > let F n (- \ 5 2 ) be the distribution 
function of x n (^ 2 )- Then for arbitrary r > 0, 

*■„(» + * +rU«) > l-«p(- 4„ + ^ + 4r ). («> 
F„(„ + i 2 -r|#) < exp(- — (16) 
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In particular, for any u G (0, 1/2), 



— ti | S 2 ) < n + 5 2 + v /(4n + 85 2 )log(?x- 1 ) + 41og(u _1 ), 



F~ 1 (u\5 2 ) > n + 5 2 - ^ (An + 86 2 ) login- 1 ). 
Moreover, for any number 5 > 0, the inequalities u < F n (n + 5 2 \ 5 2 ) < 1 — u entail 

5 2 -5 2 



< +V(4n + 8<5 2 )log(u- 1 ) + 81og(n- 1 ), 



> - V /(4n + 8(5 2 )log(u- 1 ). 

Conclusion (|19|) follows from (|15|) and (|16|) . applied to r = 5 2 — 5 2 and r = 5 2 
respectively. 

Proof of Proposition O Standard calculations show that for < t < (2A max ) _1 , 
Eexp(t5>(^ + ^) 2 ) = exp(^{^ T ^--log(l-2tA 4 )}). 

i=l i=l % 

Then for any such t, 

n 

^A,((Z J + 5,,) 2 -(l + ^ 2 )) >rn) 

i=l 

n n 

< exp(-tr ?7 -t^A i (l + 5 2 )) • Eexp(t £ X^Zi + 5, t 
i=i ' i=i 

= exp(-tr/7 + ~ E{^ TZ^ " 1 °g( 1 " 2tA *) " 2U *} 



i=l 

Elementary considerations reveal that 

V/2 ifx<0, 
x 2 /(2{l-x)) ifx>0. 

Thus (|20p is not greater than 



log(l — x) — x < 



< exp 
Setting 



/ 1 " r 2 U 2 X 2 2t 2 X 2 n 

exp( -tn 1 + - T -^- + ^^^o) }) 

l- t??7+ T^A: 



7 + 2?7A r 

the preceding bound becomes 



0, (2A 



max ) I ! 



ra 2 



»(^A J ((^ + «5 4 ) 2 -(l + 5 2 ))>r ? 7) < exp(- 



2 + 4r ? A max /7 
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Finally, since 7 > X max V2, the second asserted inequality follows from 

V 2 > v 2 Jl V > _n_ _ 1 n 



2 + 4r/A max /7 2 + V8r] V8 V8 + Arj ' V8 4' 
5.3. Proofs of the main results 

Throughout this section we assume without loss of generality that a = 1. Further let 
S n := {0, 1, . . . ,n} and T n := {(j,k) :0<j<k<n}. 

Proof of Theorem Q3 

Step I. We first analyze D n in place of D n . To collect the necessary ingredients, let the 
metric p n on T n pointwise be defined by 

Pn((j,k),(j',k')) := ^r n {j,j'Y+r n {k,k'Y. 

We need bounds for the capacity numbers D(it, T',p n ) (cf. Section [6]) for certain u > 
and T C T. The proof of Theorem 2.1 of Diimbgen and Spokoiny (2001) entails that 

d(u5, {t G % : T n (t) < 5},p n ) < 12u- 4 <r 2 for all u, 5 E (0,1]. (21) 

Note that for fixed (j, k) £ T n , ±D n (j, k) may be written as 

n 

£M(^ + M 2 -(i + C)) 

i=l 

with 

Xi = Xin(j, k) := ±{A\\6 n \\ 2 + 2n)- 1/2 I m (i), 
so |Aj| < (4||# n || 2 + 2n) . Hence it follows from Proposition [5] that 

D n (t)\ >T n (t)rj) < 2exp( !/ 1 



2 + 477(4||0 n || 2 + 2n) 1/2 /r„(t)/ 
for arbitrary t GT n and i] > 0. One may rewrite this exponential inequality as 



\D n (t)\>T n (t)G n (r],r n (t))j <2exp(-j?) (22) 
for arbitrary t £ T n and 7? > 0, where 

G n (r,,S) := h/ 



(4||0 n P + 2n) 1/2 5 
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The second exponential inequality in Proposition [5] entails that 

p(|Ai(*)| >T n {t)ij) < 2e 1/4 exp(-r ? /v / 8) (23) 

and 

F(\D n (s) - D n (t)\ > V8p n (s,t)r]) < 2e l ' i exp(-7?) (24) 

for arbitrary s,t S T n and ij > 0. 

Utilizing (|21|) and (|24p. it follows from Theorem 7 and the subsequent Remark 3 in 
Diimbgen and Walther (2007) that 

limsuppf sup \D n (s) - D n (t)\ \ =Q 

610 n \s,teT n :p n (s,t)<8 Pn[S, t) log(e/p„(s, t)) ) 

for a suitable constant Q > 0. Since D n (j,k) = D n (0,k) — D n (0,j) and T n (j,k) = 
Pn((0, j), (0, k)) , this entails the stochastic equicontinuity of D n with respect to p n . 
For < 5 < 5' < 1 define 

T n (d,d ) := sup — T n (t) 



teT n :8<r n (t)<S'\ T n{t) Tn (i) (4||6y | 2 + 2n) 1/2 / 

1 /2 

with a constant c > to be specified later. Recall that T n (t) := (2 log (e/r n (i) 2 ) 
Starting from ([21]), ([22]) and ([25]), Theorem 8 of Diimbgen and Walther (2007) and its 
subsequent remark imply that 

T n (0, J) ^0 as n -> 00 and 5 \ 0, (26) 

provided that c> 2. On the other hand, (|2~T]) . ([23]) and ([25]) entail that 

T n (S, 1) = O p (l) for any fixed S > 0. (27) 

Now we are ready to prove the first assertion about D n . Recall that D n = a~ 2 (D n + V n ) 
and 

Vn(j,k) = 2f3 n (k-j) z 

TnU,k) ' Tn (j,k)(A\\9 n f + 2n) 1/2 ^ 

with Z n being asymptotically standard normal. Since T n (j,k) < y/2(k — j)/ (4||# n || 2 + 
lVnUM < ^PW„| < ^ Pn\Z n \, (28) 



r n (j,k) y/n " ^/n 

so the maximum of |V„|/r n over 7^ is bounded by \/2/3n|^n| = Op(l)- Furthermore, since 
\%i\ < n 2 /2, one can easily deduce from (|23H that the maximum of |-D n |/r n over 7^ exceeds 
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\/32 log n + 77 with probability at most e 1//4 exp(— rj/y/8). Since a n = 1 + Opin- 1 / 2 ), these 
considerations show that 

\(D n + V n )(t)\ 



max • 



t£T n T n (t) 

and 



< VS2 logn + CUl) 



max ■ 



\D n (t) - (D n + V n )(t)\ 



t£T n T n (t) 

This proves our first assertion about D n /T n . 



Opin" 1 / 2 logra). 



Step II. Because <r 2 — > p 1, it is sufficient for the proof of the weak approximation 

d w ((D n (t)) teTn ,(A n (t)) teTn ) asTWoo (29) 

to show the result for u\D n = D n + V n with the processes D n and V n introduced in 
d?D and dl|). Here, d w refers to the dual bounded Lipschitz metric which metrizes the 
topology of weak convergence. Further details are provided in the appendix. Note that 
D n (j,k) = D n (k) - D n (j) with D n {i) := D n (0,£) and V n (j,k) = V n (k) - V n (j) with 
V n {tj := V n (Q, i). Thus we view these processes D n and V n temporarily as processes on S n . 
They are stochastically independent by Assumption (A). Hence, acccording to Lemma [H 
it suffices to show that D n and V n are approximated in distribution by 

and • (30) 



respectively. The assertion about V n is an immediate consequence of the fact that Z n := 
\fm/2(l — a 2 ) = /3~ 1 v / n(l — converges in distribution to Z while < fc/[- v /n(4||6' n || 2 + 
2n) 1/2 ] < 1/V2. 

It remains to verify the assertion about D n . It follows from the results in step I that 
the sequence of processes D n on S n is stochastically equicontinuous with respect to the 
metric r n on S n x S n . More precisely, 

max \D n (k)-D n U)\ = Q (1) 
U,k)eT n T n (j,k)log(e/T n (j,k) 2 ) pK h 

and it is well-known that (W(r n (0, k) 2 )) kGS has the same property, even with the factor 
log(e/r„(j, A;) 2 ) 1 / 2 in place of log(e/r n (j, A;) 2 ). Moreover, both processes have independent 
increments. Thus, in view of Theorem [8] in Section [6l it suffices to show that 

max d w (D n {j,k),W(Tn{0,k))-W(T n (j))) 0. (31) 
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To this end we write D n (j, k) = -D n ,i(j, k) + D n>2 (j, k) + D Ut3 (j, k) with 

k 

D n ,i(j,k) := (4\\9 n \\ 2 + 2ny 1/2 £ l{\0 in \ < 5 n }(26 in e m + e 2 in - 1), 

i=j+l 
k 

D n , 2 (j,k) := (4\\e n \\ 2 + 2ny 1/2 £ l{\0 in \ > 5 n }29 m e m , 

i=j+l 
k 

D n , 3 (j,k) := (4\\e n \\ 2 + 2ny 1/2 £ l{|0 in | x5 n }(e? n - 1) 

i=j+l 

1 /9 

and arbitrary numbers (5 n > such that 5 n — > oo but 5 n /(4||0 n || 2 + 2n) 7 -» 0. These 
three random variables D ntS (j, k) are uncorrelated and have mean zero. The number a n := 
\{i : \9i n \ > 5 n }\ satisfies the inequality \\9 n \\ 2 > a n 5^, whence 

n^M.m < < ^ - o. 

Moreover, D n> i(j,k) and D nj2 {j,k) are stochastically independent, where D Uj i(j,k) is 
asymptotically Gaussian by virtue of Lindeberg's CLT, while D n ^{j,k) is exactly Gaus- 
sian. These findings entail (|29l) . 

Step III. For < 5 < 5' < 1 define 

5„(<J,<5):= sup — T n (t) m , 

ter. V r n (t) (qe n \\i + 2n) 1/2 T n (t)J 

0<Tn(t)<d 

£ n (<),<5 ) := sup — — Ln{j,k) 

(j,fc)GT n : V T n (j,k) 

S<T n (j,k)<S' 

Since S„(0, 1) < T n (0, 1) + V2/3 n \Z n \, it follows from ([26]) and §27$) that 5 n (0, 1) = O p (l). 

^ icy 

As to the approximation in distribution, since r n (0, n)(4||# n || 2 + 2n) > \J2n — ► oo, 

r n (t) 2 



max 

t:T n (i)>5 



while max \T n (t)\ = 0(1) 

t:T n (t)>& 



(A\\e n f + 2n) 1/2 r n (t) 
for any fixed S € (0, 1). Consequently it follows from step II that 

d w (S n (M),£n(M)) -» (32) 

for any fixed 5 € (0, 1). Thus it suffices to show that 

S n (0, 5), S n (0, 5) — > p as n — > oo and 5 \ 0, 
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provided that ||#n|| 2 = 0(n). For £ n (0, 5) this claim follows, for instance, with the same 
arguments as (f26|h Moreover, S n (0,6) is not greater than 

Tfn «, \Vn(t)\ (4||0 w || 2 + 2n) 1/2 g 
T n (0,5)+ sup — 7— < T n (0,d)H = <5, 

teT n :r n (t)<6 T n{t) V n 

according to (|28|) . Thus our claim follows from (|26p and ||# n || 2 = O(n). □ 

Proof of Proposition [2j The main ingredient is a well-known representation of non- 
central x 2 distributions as Poisson mixtures of central x 2 distributions. Precisely, 

3=0 J - 

as can be proved via Laplace transforms. Now we define 'time points' 

k 

tkn ■■= J2 e in aIld : = tj{n)n + & ~ j(n) 

with j(n) any fixed index in fC n {9 n )- This construction entails that t kn > with equality 
if, and only if, k £ K n {9 n ). 

Figure CD illustrates this construction. It shows the time points tkn (crosses) and t% n 
(dots and line) versus k for a hypothetical signal 6 n 6 M 40 . Note that in this example, 
/C„(0 n ) is given by {10,11,20,21}. 

Let LT, G\, G2, • • • , G n , Zi, Z2, Z3, . . .and S 2 be stochastically independent random 
variables, where n = (U(t)) t > is a standard Poisson process, Gj and are standard 
Gaussian random variables, and S 2 ~ x m - Then one can easily verify that 



Tjkn '— 77— . W2 ( G i+ £ 7 



2n(i fc „/2) 



i=i+l a=2n(t in /2)+l 



s I ) 



, k 2U(tt 12) 

f* . = m ( v r 2 + V z 2 
^ JP v. =j . +1 s=2 n( t | n /2)+i 



define random variables (T J fc n )o<i<fc<n and (T* kn )o<j < k<n with the desired properties. □ 
In the proofs of Theorems [3] and H we utilize repeatedly two elementary inequalities: 

Lemma 7. Let a, b, c be nonnegative constants. 



(i) Suppose that 0<x<y<x+ \Jb{x + y) + c. Then 

y < x + V2bx + b+ Vbc + c < x + V2bx + (3/2)(6 + c). 
fiz) For x > define h{x) := x + + &5 + c. T/ien 

h(h(x)) < x + 2v / a+~te + 6/2 + \/&c + 2c. 
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Fig 1 . Construction of the a 



Proof of Lemma The inequality y < x + yb\x + y) + c entails that either y < x + c 
or 

(y — x — c) 2 < b(x + y) = 2bx + b(y — x). 

Since y < x + c is stronger than the assertions of part (i), we only consider the displayed 
quadratic inequality. The latter is equivalent to 

(y-x- (6/2 + c)) 2 < 2bx + (6/2 + c) 2 - c 2 = 26a; + 6 2 /4 + be. 

Hence the standard inequality y/YU z i — Si ^ or nonnegative numbers ^ leads to 



y-x < V2bx + Jb 2 /4 + Vbc + 6/2 + c = \/26x + 6 + \/6c + c. 



Finally, < (Vb - yfc) entails that Vbc < (6 + c)/2. 
As to part (ii), the definition of h(x) entails that 



h(h(x)) 



+ Va + bx + V a + bx + bV a + bx + bc + 2c 



< x + V a + bx + \l a + bx + 6V a + 6x + v6c + 2c 



= j; + \/a + 62; + \/a + 6x y 1 + b/Va + bx + Vbc + 2c 
< x + 2Va + bx + 6/2 + Vbc + 2c, 
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because yl + d < 1 + d/2 for arbitrary d > 0. □ 

Proof of Theorem[3]. The definition of K, n ^ and Proposition[2]together entail that fC n ^ a 
contains fC n {6 n ) with probability at least 1 — a. The assertions about immediate 
consequences of Theorem [1] applied to 6 n = 9*. 

1 /2 

Now we verify the oracle inequalities (fTUj) and (fTT|) , Let j n := (4||# n || 2 + 2n) x r n . 
With 7* we denote the function 7 n on 7^ corresponding to Throughout this proof 
we use the shorthand notation M n (£,k) := M n (£) — M n (k) for M n = R n ,R n ,L n ,L n and 
arbitrary 1,1:6 C n . Furthermore, j n *\£, k) := j n *\k, i) if I > A;, and ^n\k, k) := 0. 

In the subsequent arguments, k n := min(/C n (6 l n )), while j stands for a generic index in 
K, n ,a- The definition of the set fC n ^ a entails that 

RnU,k n ) < al[^{j,k n )(r(^^)+K n;a )+0(logn)]. (33) 

Here and subsequently, 0(r n ) and O p (r n ) denote a generic number and random variable, 
respectively, depending on n but neither on any other indices in C n nor on a S (0,1). 
Precisely, in view of our remark on dependence of a, we consider all a > a(n) with a{n) > 
such that K„ i0 ( n ) = Oin 1 / 6 ). Note that a 2 = 1 + O p (n~ 1 / 2 ). Moreover, l n {j,k n ) 2 T({j - 
k n )/n) equals 12nxlog(e/x) < 12n with x := \j — k n \/n G [0,1]. Thus we may rewrite 
(133|) as 

£n(j,*n) < 7;(i,^)(r(^^)+K„,a)+O p (logn). (34) 
Combining this with the equation R n (j, k n ) = R n (j, k n ) — D n (j, k n ) yields 

RnU,k n ) < j* n {j,k n )(r(^^) +K n , a ) +O p (logn) + \D n (j,k n )\. (35) 
Since J*(j,k n ) 2 < 6n and max te r„ \D n {t)\hn{t) = O p (logn), ([35]) yields 

R n (j,k n ) < Vl2n + V6nK n , a + O p (log n)7„(j, k n ). 
But elementary calculations yield 

ln{j,k n ) 2 = 7*(j, /c„) 2 + sign(fe n - j)R n (j, k n ) < 6n + R n {j,k n ). (36) 
Hence we may conclude that 



Rn{j,k n ) < Op(logn) jRn(j, k n ) + O p (v / n(logn + Kn )Q )), 
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and Lemma[7] (i), applied to x = and y = R n (j, k n ), yields 
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max R n (j,k n ) < O p (yJn{\ogn + (37) 

This preliminary result allows us to restrict our attention to indices j in a certain subset 
of C n : Since < R n {n, k n ) = n - k n - J2i=k n +i 

n 

°in ^ n-k n . 

i=k„+l 

On the other hand, in case of j < k n , R n (j, k n ) = E*=j+i C ~ (kn - j), so 

n 

d in ^ n + °p (VnQ-Ogn + K n , Q )). 

Thus if j n denotes the smallest index j 6 C n such that X)it=i+i — then fe„ > j n , and 
^Cn,o C {j n , ■ ■■ ,n} with asymptotic probability one, uniformly in a > a(n). This allows 
us to restrict our attention to indices j in {j n , . . . ,n} Pi /C n)Q . For any £ > j n , D n (£, k n ) 
involves only the restricted signal vector (#m)f=j n +ij and the proof of Theorem [1] entails 
that 

f \Dn(i,k n )\ 2c log n \ + 

TfT~\ V 21ogn 7T7TT = ^pC 1 )- 

Jn<^<n\ ln{£,k n ) Jn{£,k n )J 

Thus we may deduce from ()35p the simpler statement that with asymptotic probability 
one, 



R n (j,k n ) < ( 7 *(j, fe„) + 7n (j, fc„)) (V2 logn + K n , Q + O p (l)) (38) 
+ O p (logn). 

Now we need reasonable bounds for 7^(j, fc n ) 2 in terms of R n (j) and the minimal risk 
p n = R n (k n ), where we start from the equation in ([36]) : If j < k n , then j n (j,k n ) 2 = 
ln(j,k n ) 2 + 4,R n (j,k n ) and -y*(j,k n ) 2 = 6(k n - j) < 6p n . If j > k n , then 7*(j,fc„) 2 = 
7n(j, &n) 2 + 4-R„(j, fen) and 



7n(j,fen) 2 = £ (40L + 2) < 4p n + 2i? n (j) = 6p n + 2R n (j,k n ) 



3 

L 

i=k n +l 

Thus 

ln(j,k n ) + j n (j,k n ) < 2V6^Pn~+ (V2 + V6)jR n (j,k n ), 
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and inequality (j38j) leads to 

2V6K niQ; + O p (l) 



'Pn 



+ O p (Vlog n + K n>a ) y Rn{j, k n ) + Op(logn) 

for all j G K, n t0l . Again we may employ Lemma[7]with x — and y — Rn(ji kn) to conclude 
that 



max R n (j,k n ) < Uv^Vlogre + 2\/6k„ iQ , + p (l 



'Pr. 



+ O p ((log(re) 3 / 4 + k 3 JI ( Jp 1 ^ + logn + k, 



uniformly in a > 0. 

If log(n) 3 + «£ (Q[ ( n ) = 0(/o n ), then the previous bound for fc n ) = R n (j) - p n reads 

max Rn(j) < p n + (4\/3v / logn + 2\Z6K„ >ce + O p (l))y^ 

uniformly in a > a (re). On the other hand, if we consider just a fixed a > 0, then 
K n ,« = 0(1), and the previous considerations yield 



max Rn(j) < p n + (4\/3 + o p (l))ylog(n) p„ 
+ O p (log(n) 3 / 4 /3 y 4 + logn) 



< p n + (4v / 3 + o p (l))^log(n)p n + O p (logn). 
To verify the latter step, note that for any fixed e > 0, 

l°g(n) 3/ V/ 4 < I 6 ' ! 10 '" i^^^logn, 
[e- v /log(n)p n if p n > e 4 logn. 

It remains to prove claim (jlip about the losses. From now on, j denotes a generic index 
in C n . Note first that 

L n (j,k n ) - R n (j,k n ) = ^2 (l-ef n ) = R n (k n , j) - L n {k n , j) if j < k. 

i=j+i 

Thus Theorem [TJ applied to 0„ = 0, shows that 



|^n(i,M - Rn(j>kn)\ ^ 7n 0"> k n) (Y 2 log n + O p (l)) + O p (logre), 

where 

7 +(j,A; n ) := ^2\k n -j\ < ^/Wn + ^2\R n (j,k)\. 
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It follows from L n (0) = i2 n (0) = ||#n|| 2 that L n (j) — p n equals 

L n (j, K) + (L n - R n )(k n ,0) 



= Rnij, k n ) + O p (^log(ra)p n ) + O p {y/log n) ^R n (j,k n ) + O p (log 
> Op (y log(n)p n + log nj , 



n) 



because R n {j,k n ) > and R n {j,k n ) + O p {r n )^R n {j, k n ) > O p (r n ). Consequently, p ri 
minjgCn L n {j) satisfies the inequality 



Pn > Pn + O p [sJlog(n)p n + \ogn) = (l + o p (l))p n + Op(logn), 
and this is easily shown to entail that 



Pn < Pn + Op (-v/log n) \fp~ n + O p (log n) = (1 + o p (l))p n + O p (log n). 

Now we restrict our attention to indices j 6 AC n a again. Here it follows from our result 
about the maximal risk over KL n ,a that L n (j) — p n equals 



R n {j, k n ) + Op{^\og{n)p n ) + O p {^\og n)^R n (j, k n ) + O p (logn) 
< 2R n (j,k n ) + O p (yJlog(n)p n + logn) < O p [yJ\og{n)p n + log 

Hence max^^- ^nCO is not greater than 



n 



Pn + O p [J\og{n)p n + logn) < /5 n + O p (v / Iogn)v / ^ + Op(logn). □ 



Proof of Theorem [4j The application of inequality (|19p in Corollary [6] to the tripel 
(\J\,T n (J) — |J|, a/(2M n )) in place of (n,5 2 ,a) yields bounds for S nal (J) and S nau (J) 
in terms of 5 n (J) := (T n {J) — \ J\)+. Then we apply (|17H18p to T n (J), replacing (n, S 2 ,u) 
with (| J|, S n ( J), a'/(2M n )) for any fixed a' £ (0, 1). By means of Lemma (ii) we obtain 
finally 

5 hn J) i2 62ni i J j\] ^ (l + o p (l))J(16\J\+325 n (J))logM n (39) 



K{J)-5l a AJ) 



+ (X + Op(l))logM n 



for all J G A^ n . Here and throughout this proof, K denotes a generic constant not depend- 
ing on n. Its value may be different in different expressions. It follows from the definition 



A. Rohde and L. Diimbgen/ 'Confidence Sets for the Best Approximating Model 26 
of the confidence region fC n ^ a that for arbitrary C E £ n , a and D E C n , 

Rn^-RniD) = 5l(D\C)-5 2 n (C\D) + \C\-\D\ 

= & ~ %, a ,l)(D \ C) + (3^ - 5 2 n ){C \ D) 

< (<£ ~ S 2 n , a ,d(D \ C) + (8* - «£)(C \ D). 



Moreover, according to (|39p the latter bound is not larger than 
(1 + o p (l)){^W\D \ C\ + 325 2 n (D \ C)) logM n 



+ 



(16|C \ D\ + 32^ (c \ D)) logM n } + (if + 0p (l)) logM n 



< (1 + o p (1)) v / 2(16|D| +3251 (C c ) + 16|C| + 32<52(£)c)) i og M n 

+ (K + o p (l)) logM n 



< ^{R n {C)+R n {D))\ogM n {l + o p {\)) + {K + o p (l))\ogM n . 
Thus we obtain the quadratic inequality 



R n (C)-R n (D) < 8^{R n (C) + R n (D))logM n (l + o p (l)) 

+ (K + o p (l))logM n , 

and with Lemma [7] this leads to 



Rn(C) < R n (D) + 8^2^/ Rn(D) log M n (l + op(l)) + (if + op(l)) log M n . 

This yields the assertion about the risks. 

As for the losses, note that L n (-) and R n (-) are closely related in that 

(L n -R n )(D) = Y, e l-\ J \ 

for arbitrary D E C n . Hence we may utilize (j 1 7ti 1 8 [) . replacing the tripel (n,5 2 ,u) with 
(|D|,0, a' /(2fA n )), to complement (|39p with the following observation: 



-A^\D\\ogM n < L n (D) — R n (D) < A^J\D\\ogM n + A\ogM n (40) 
simultaneously for all D E C n with probability tending to one as n — > oo and A —* oo. 



Note also that ([H implies that Rn{D) < Ay/R n (D) log M n + L n (D). Hence 
i^(£>) < (3/2)(L n (L>) + ,4 2 logM n ) for all D £ C n , 
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by Lemma (i). Assuming that both (|39p and (|40p hold for some large but fixed A, we 
may conclude that for arbitrary C E /C njQ and D E C n , 

L n (C) - L n (D) 
= (L n - R n ){C) - (L n - R n ){D) + RniC) - R n (D) 

< Ay/2(\C\ + \D\)\ogM n + A^2(R n {C) + R n (D)) log M n + AA log Af n 

< 2,4^2 (i? n (C) + i? n (D)) logM n + 4AlogM n 

< ^'V / (L n (C) + L n (D)) \ogM n + A"\ogM n 

for constants A' and A" depending on A. Again this inequality entails that 

L n (C) < L n (D) + A'^2L n {D) log M n + A 1 " log M n 
for another constant A'" = A'" (A). □ 

6. Auxiliary results 

This section collects some results from the vicinity of empirical process theory which are 
used in the present paper. 

For any pseudo-metric space (X, d) and u > 0, we define the capacity number 

D(u,X,d) := maxjl^ol : X Q C X,d(x,y) > u for different x,y E X ). 

It is well-known that convergence in distribution of random variables with values in 
a separable metric space may be metrized by the dual bounded Lipschitz distance. Now 
we adapt the latter distance for stochastic processes. Let loo{T) be the space of bounded 
functions x : T — > R, equipped with supr6iiru.m norm || • [joo* For two stochastic processes 
X and Y on T with bounded sample paths we define 

d w (X,Y) := sup |E*/(X)-E*/(F)|, 
feH(X) 

where P* and E* denote outer probabilities and expectations, respectively, while T~C(T) is 
the family of all funtionals / : tooiT) — * M such that 

|/0)| < 1 and \f(x)-f(y)\ < \\x - y||oo for all x, y E 4oC0- 

If d is a pseudo-metric on T, then the modulus of continuity w(x, S\d) of a function 
x E loo{T) is defined as 

w(x,(5|(i):= sup \x(s) — x(t)\. 

s,te~T :d(s,t)<5 



A. Rohde and L. Dumbgen/ Confidence Sets for the Best Approximating Model 28 
Furthermore, C U (T, d) denotes the set of uniformly continuous functions on (T, d), that is 

C u (T,d) = {x G Zoo CO : lim w(x,S\d) = o}. 

Theorem 8. For n = 1,2,3,... consider stochastic processes X n = (X n (t)) t£j - and 

Y n = (Y n (t)) 

teT n 071 a me ^ c space Pn) with bounded sample paths. Then 

d w (X n ,Y n ) — > 

provided that the following three conditions are satisfied: 
(i) For arbitrary subsets ofT n with \T n ,o\ = 0(1), 

d w [X n \^- , Y n _. ) ► 0, 

fwj /or eac/t number e > 0, 

lim limsupP*(w(Z n , (5 | p n ) > e) = for Z n = X n ,Y n ; 

o\0 n— >oo 

fmj for any 5 >0, B(5,T n , p n ) = 0(1). 

Proof. For any fixed number 5 > let 7^ iG be a maximal subset of T n such that p n (s, i) > 
<5 for differnt s,i G 7^ iG . Then \T nj0 \ = 0(1) by Assumption (iii). Moreover, for any t e T n 
there exists a t Q G %i, such that p n (t,t ) < 5- Hence there exists a partition of 7^ into 
sets B n (t Q ), t Q G T nj0 , satisfying t Q G B n (t ) C {t e T n : p n (t,t ) < 5}. For any function x 
in too(%i) or £oo(%i,o) let (T n ) be given by 

vr n x(t) := Yl HteB n (t )}x(t ). 

to ^^n,o 

Then ir n x is linear in xL with ||7r ra x||oo = INt- II • Moreover, any x G (<x>(%i) satisfies 
the inequality ||x — 7r n x||oo < u>(x, 5 | p n ). Hence for Z n = X n , Y n , 

d w (Z n ,ir n Z n ) < sup E*\h(Z n ) - h(ir n Z n )\ 
heH(T n ) 

< E*min(||Z„ - 7r n Z n ||oo, 1) 

< E*mm(w(Z n ,6\p n ),l), 

and this is arbitrarily small for sufficiently small 5 > and sufficiently large n, according 
to Assumption (ii). 
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Furthermore, elementary considerations reveal that 

^w("'ii^n! "'ii^nj — U w I X n _ , Y n _. J, 

and the latter distance converges to zero, because of \T n , \ = 0(1) and Assumption (i). 
Since 

dw(X n , Y n ) < d w (X n , iT n X n ) + d w (Y n , n n Y n ) + d w (-K n X n , TT n Y n ), 

these considerations entail the assertion that d w (X n ,Y n ) — >• 0. □ 
Finally, the next lemma provides a useful inequality for d w (-, •) in connection with sums 
of independent processes. 

Lemma 9. Let X = X\ + X 2 and Y = Y\ + 12 independent random variables X\, 
X2 and independent random variables Y\, Yi, all taking values in (^(T), || • ||oo)- Then 

d w (X,Y) < d w (X 1 ,Y 1 ) + d w (X 2 ,Y 2 ). 

For this lemma it is important that we consider random variables rather than just 
stochastic processes with bounded sample paths. Note that a stochastic process on T 
is automatically a random variable with values in (^(T), || • Hoc) if (a) the index set 
T is finite, or (b) the process has uniformly continuous sample paths with respect to a 
pseudo-metric d on T such that N(u,T, d) < 00 for all u > 0. 

Proof of Lemma [9j Without loss of generality let the four random variables X\, X2, 
Y\ and Y 2 be defined on a common probability space and stochastically independent. Let 
/ be an arbitrary functional in TC(T). Then it follows from Fubini's theorem that 

\Ef(X 1 + X 2 )-Kf(Y 1 + Y 2 )\ 

< \Ef(Xi + X 2 ) - Ef(Y 1 + X 2 )\ + |E/(Yi + X 2 ) - Ef(Y 1 + Y 2 )\ 

< E E(f(X 1 + X 2 ) I X 2 ) - E(f(Y 1 + X 2 ) I X 2 ) 



+ E 



E(f(Y 1 + X 2 ) I Y x ) - E(/(Yi + Y 2 ) \ Y x ) 



< d w {X 1 ,Y 1 )+d w (X 2 ,Y 2 ). 

The latter inequality follows from the fact that the functionals x 1— ► f(x + X 2 ) and x 1— > 
f(Yi + x) belong to W(T), too. Thus d w (X,Y) < d w (Xi,Y\) + d w (X 2 ,Y 2 ). □ 



Acknowledgement. Constructive comments of a referee are gratefully acknowledged. 



A. Rohde and L. Dumbgen/ Confidence Sets for the Best Approximating Model 



30 



References 

[1] Baraud, Y. (2004). Confidence balls in Gaussian regression. Ann. Statist. 32, 
528-551. 

[2] Beran, R. (1996). Confidence sets centered at C p estimators. Ann. Inst. Statist. 
Math. 48, 1-15. 

[3] Beran, R. (2000). REACT scatterplot smoothers: super efficiency through basis 

economy. J. Amer. Statist. Assoc. 95, 155-169. 
[4] Beran, R. and Dumbgen, L. (1998). Modulation of estimators and confidence sets. 

Ann. Statist. 26, 1826-1856. 
[5] Birge, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. 

3, 203-268. 

[6] Cai, T.T. (1999). Adaptive wavelet estimation: a block thresholding and oracle 

inequality approach. Ann. Statist. 26, 1783-1799. 
[7] Cai, T.T. (2002). On block thresholding in wavelet regression: adaptivity, block size, 

and threshold level. Statistica Sinica 12, 1241-1273. 
[8] Cai, T.T. and Low, M.G. (2006). Adaptive confidence balls. Ann. Statist. 34, 

202-228. 

[9] Cai, T.T. and Low, M.G. (2007). Adaptive estimation and confidence intervals for 

convex functions and monotone functions. Manuscript in preparation. 
[10] DAHLHAUS, R. and POLONIK, W. (2006). Nonparametric quasi-maximum likelihood 

estimation for Gaussian locally stationary processes. Ann. Statist. 34, 2790-2824. 
[11] Donoho, D.L. and Johnstone, I.M. (1994). Ideal spatial adaptation by wavelet 

shrinkage. Biometrika 81, 425-455. 
[12] Donoho, D.L. and Johnstone, I.M. (1995). Adapting to unknown smoothness via 

wavelet shrinkage. JASA 90, 1200-1224. 
[13] Donoho, D.L. and Johnstone, I.M. (1998). Minimax estimation via wavelet 

shrinkage. Ann. Statist. 26, 879-921. 
[14] Dumbgen, L. (2002). Application of local rank tests to nonparametric regression. J. 

Nonpar. Statist. 14, 511-537. 
[15] Dumbgen, L. (2003). Optimal confidence bands for shape-restricted curves. Bernoulli 

9, 423-449. 

[16] Dumbgen, L. and Spokoiny, V.G. (2001). Multiscale testing of qualitative hy- 
potheses. Ann. Statist. 29, 124-152. 

[17] Dumbgen, L. and Walther, G. (2007). Multiscale inference about a density. Tech- 
nical report 56, IMSV, University of Bern. 

[18] Efromovich, S. (1998). Simultaneous sharp estimation of functions and their deriva- 
tives. Ann. Statist. 26, 273-278. 

[19] Futschik, A. (1999). Confidence regions for the set of global maximizers of non- 
parametrically estimated curves. J. Statist. Plann. Inf. 82, 237-250. 

[20] Genovese, C.R. and Wassermann, L. (2005). Confidence sets for nonparametric 
wavelet regression. Ann. Statist. 33, 698-729. 

[21] Hengartner, N.W. and Stark, P.B. (1995). Finite-sample confidence envelopes 
for shape-restricted densities. Ann. Statist. 23, 525-550. 

[22] Hoffmann, M. and Lepski, O. (2002). Random rates in anisotropic regression 
(with discussion). Ann. Statist. 30, 325-396. 



A. Rohde and L. Dumbgen/ Confidence Sets for the Best Approximating Model 



31 



[23] Lepski, O.V., Mammen, E. and Spokoiny, V.G. (1997). Optimal spatial adap- 
tation to inhomogeneous smoothness: an approach based on kernel estimates with 
variable bandwidth selectors. Ann. Statist. 25, 929-947. 

[24] Li, K.-C. (1989). Honest confidence regions for nonparametric regression. Ann. 
Statist. 17, 1001-1008. 

[25] Polyak, B.T. and Tsybakov, A.B. (1991). Asymptotic optimality of the C p -test 
for the orthogonal series estimation of regression. Theory Probab. Appl. 35, 293-306. 

[26] Robins, J. and van der Vaart, A. (2006). Adaptive nonparametric confidence 
sets. Ann. Statist. 34, 229-253. 

[27] Stone, C.J. (1984). An asymptotically optimal window selection rule for kernel 
density estimates. Ann. Statist. 12, 1285-1297. 



